CNV-80608: add CRD support and create alert rule API by sradco · Pull Request #841 · openshift/monitoring-plugin

sradco · 2026-03-12T16:32:33Z

Add AlertingRule, AlertRelabelConfig, and
RelabeledRules CRD interfaces with the
management client, router, server wiring,
and POST /api/v1/alerting/rules endpoint.

Signed-off-by: Shirly Radco sradco@redhat.com
Signed-off-by: João Vilaça jvilaca@redhat.com
Signed-off-by: Aviv Litman alitman@redhat.com
Co-authored-by: AI Assistant noreply@cursor.com

coderabbitai · 2026-03-12T16:32:44Z

Walkthrough

This pull request introduces a comprehensive alert-rule management API for Kubernetes/OpenShift monitoring. The changes add support for creating user-defined and platform alert rules through HTTP endpoints, with Kubernetes resource management, conflict detection, external management enforcement, and extensive validation. Build configuration updates enable compilation against new internal packages.

Changes

Cohort / File(s)	Summary
Docker & Build Configuration `Dockerfile`, `Dockerfile.dev`, `Dockerfile.dev-mcp`, `Dockerfile.devspace`, `Dockerfile.konflux`, `Dockerfile.mcp`	All Docker build stages now copy the `internal/` directory into the Go builder image to expand the build context for backend compilation.
Build Tooling `Makefile`	Updated `test-backend` to include `./internal/...` package tests; added new `test-e2e` target running integration tests against `./test/e2e` with a 150-minute timeout.
Go Dependencies `go.mod`	Upgraded Go toolchain from 1.24.0 to 1.25.0; added testing dependencies (ginkgo/v2, gomega), OpenShift API support; updated Kubernetes modules (k8s.io/api, k8s.io/client-go, k8s.io/apimachinery) to v0.35.0; refreshed indirect dependencies across multiple library versions.
Kubernetes Client Extensions `pkg/k8s/types.go`, `pkg/k8s/client.go`	Defined new manager interfaces (`AlertRelabelConfigInterface`, `AlertingRuleInterface`, `RelabeledRulesInterface`); extended `Client` interface with corresponding accessor methods; initialized new managers in `NewClient` with error propagation.
Kubernetes Resource Managers `pkg/k8s/alert_relabel_config.go`, `pkg/k8s/alerting_rule.go`, `pkg/k8s/relabeled_rules.go`	Implemented three managers providing informer-backed CRUD operations for OpenShift monitoring resources; `relabeled_rules.go` includes complex in-memory caching, deterministic ID generation, relabel config application, and external management metadata tracking.
External Management Detection `pkg/k8s/external_management.go`	Added utilities to identify GitOps-managed and operator-managed resources via annotations, labels, and owner references; exported constants for Argo CD and Kubernetes management label keys.
Management API Core `pkg/management/types.go`, `pkg/management/client_factory.go`, `pkg/management/management.go`	Defined `Client` interface with alert rule creation methods and `PrometheusRuleOptions` configuration struct; implemented factory constructor and internal client struct.
Alert Rule Utilities `pkg/alert_rule/alert_rule.go`, `pkg/managementlabels/management_labels.go`	Added deterministic alert rule ID generation with SHA-256 hashing; defined label/annotation key constants for rule management and external management tracking.
Error Handling `pkg/management/errors.go`	Introduced four custom error types (`NotFoundError`, `NotAllowedError`, `ValidationError`, `ConflictError`) for structured API error responses.
Validation & Preconditions `pkg/management/label_utils.go`, `pkg/management/alert_rule_preconditions.go`, `pkg/management/alert_rule_id_match.go`	Added severity validation, protected-label checks, precondition enforcement (blocking GitOps/operator-managed mutations), and rule ID matching utilities.
Platform Alert Rule Creation `pkg/management/create_platform_alert_rule.go`, `pkg/management/create_platform_alert_rule_test.go`	Implemented platform alert rule creation with external-management validation, group reconciliation, and conflict detection; comprehensive test coverage for success/error scenarios.
User-Defined Alert Rule Creation `pkg/management/create_user_defined_alert_rule.go`, `pkg/management/create_user_defined_alert_rule_test.go`	Implemented user-defined alert rule creation with validation, duplicate/equivalence detection, and target namespace enforcement; extensive test scenarios covering validation, duplicates, and group defaults.
Management Test Utilities `pkg/management/testutils/k8s_client_mock.go`	Provided comprehensive mock implementations for all Kubernetes client interfaces (PrometheusRule, AlertRelabelConfig, AlertingRule, RelabeledRules, Namespace) with configurable function overrides and in-memory storage.
HTTP Router & Handlers `internal/managementrouter/router.go`, `internal/managementrouter/create_alert_rule.go`, `internal/managementrouter/query_filters.go`	Created HTTP router setup with error-to-status-code mapping; implemented POST handler for alert rule creation with request validation, JSON encoding/decoding, and status code routing.
Router Tests & Suite `internal/managementrouter/create_alert_rule_test.go`, `internal/managementrouter/managementrouter_suite_test.go`	Added Ginkgo test suite validating HTTP handler behavior across success/error scenarios including invalid JSON, missing fields, and platform-rule restrictions.
Server Integration `pkg/server.go`	Conditionally initialized management client and Kubernetes connection when feature flag `AlertManagementAPI` is enabled; integrated management router into HTTP server at `/api/v1/alerting`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-03-12T16:33:15Z

@sradco: This pull request references CNV-80608 which is a valid jira issue.

Details

In response to this:

Add AlertingRule, AlertRelabelConfig, and
RelabeledRules CRD interfaces with the
management client, router, server wiring,
and POST /api/v1/alerting/rules endpoint.

Signed-off-by: Shirly Radco sradco@redhat.com
Signed-off-by: João Vilaça jvilaca@redhat.com
Signed-off-by: Aviv Litman alitman@redhat.com
Co-authored-by: AI Assistant noreply@cursor.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-03-12T16:34:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sradco
Once this PR has been reviewed and has the lgtm label, please assign peteryurkovich for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sradco · 2026-03-12T16:37:04Z

Hi @jgbernalp , @jan--f , @simonpasquier, I restructured the code so this PR include the first API,
the POST /api/v1/alerting/rules endpoint.
It is a larger then the future PRs since it also includes the infrastructure that is needed and will be used also in later PRs.
Would appreciate you review.

openshift-merge-robot · 2026-03-18T10:30:13Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Add AlertingRule, AlertRelabelConfig, and RelabeledRules CRD interfaces with the management client, router, server wiring, and POST /api/v1/alerting/rules endpoint. Signed-off-by: Shirly Radco <sradco@redhat.com> Signed-off-by: João Vilaça <jvilaca@redhat.com> Signed-off-by: Aviv Litman <alitman@redhat.com> Co-authored-by: AI Assistant <noreply@cursor.com>

openshift-ci · 2026-03-25T15:09:30Z

@sradco: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-images	`80147f1`	link	true	`/test okd-scos-images`
ci/prow/images	`80147f1`	link	true	`/test images`
ci/prow/go-tests	`80147f1`	link	true	`/test go-tests`
ci/prow/e2e-aws-ovn	`80147f1`	link	true	`/test e2e-aws-ovn`
ci/prow/periodics-images	`80147f1`	link	true	`/test periodics-images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

simonpasquier · 2026-03-25T15:42:10Z

@coderabbitai review

coderabbitai · 2026-03-25T15:42:19Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (8)

pkg/k8s/alert_relabel_config.go (1)
95-102: Inconsistent error message in Delete - missing namespace.

The Create and Update methods include both namespace and name in their error messages (%s/%s), but Delete only includes the name. This inconsistency could make debugging harder when multiple namespaces contain resources with the same name.
Proposed fix
 func (arcm *alertRelabelConfigManager) Delete(ctx context.Context, namespace string, name string) error {
 	err := arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Delete(ctx, name, metav1.DeleteOptions{})
 	if err != nil {
-		return fmt.Errorf("failed to delete AlertRelabelConfig %s: %w", name, err)
+		return fmt.Errorf("failed to delete AlertRelabelConfig %s/%s: %w", namespace, name, err)
 	}
 
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/k8s/alert_relabel_config.go` around lines 95 - 102, The Delete method on
alertRelabelConfigManager returns an error message that only includes the
resource name; update alertRelabelConfigManager.Delete to include both namespace
and name in the error string (matching the Create/Update pattern) by formatting
the error as "%s/%s" (namespace/name) when wrapping the returned error from
arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Delete; ensure the
error message text mirrors the existing style used in Create/Update for
consistency.
internal/managementrouter/create_alert_rule.go (1)
21-26: Consider limiting request body size.

The handler decodes the request body without size limits. A malicious client could send an extremely large payload, potentially causing memory exhaustion. Consider using http.MaxBytesReader to limit the request body size.
Proposed fix
 func (hr *httpRouter) CreateAlertRule(w http.ResponseWriter, req *http.Request) {
+	// Limit request body to 1MB to prevent memory exhaustion
+	req.Body = http.MaxBytesReader(w, req.Body, 1<<20)
+
 	var payload CreateAlertRuleRequest
 	if err := json.NewDecoder(req.Body).Decode(&payload); err != nil {
 		writeError(w, http.StatusBadRequest, "invalid request body")
 		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/managementrouter/create_alert_rule.go` around lines 21 - 26, The
CreateAlertRule handler decodes req.Body without any size limit, risking memory
exhaustion; wrap the request body with http.MaxBytesReader (e.g., maxBytes
constant like 1<<20 for 1MB) before decoding in CreateAlertRule on the
httpRouter, then decode into CreateAlertRuleRequest from that wrapped reader and
handle a size-exceeded error by calling writeError(w,
http.StatusRequestEntityTooLarge, "...") (or appropriate message) so overly
large payloads are rejected gracefully.
internal/managementrouter/router.go (1)
62-64: Consider limiting error details in 500 responses.

The current implementation returns the full internal error message to the client on unexpected errors. While this aids debugging, it may leak implementation details (e.g., database connection strings, internal paths) in production.
Proposed fix - return generic message to client
 	log.Printf("An unexpected error occurred: %v", err)
-	return http.StatusInternalServerError, fmt.Sprintf("An unexpected error occurred: %s", err.Error())
+	return http.StatusInternalServerError, "An unexpected error occurred"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/managementrouter/router.go` around lines 62 - 64, The handler
currently returns the full internal error string in 500 responses; update the
code that returns http.StatusInternalServerError (the block using log.Printf("An
unexpected error occurred: %v", err) and return http.StatusInternalServerError,
fmt.Sprintf("An unexpected error occurred: %s", err.Error())) to keep detailed
error logging server-side but send a generic client message instead (e.g., log
the full err with log.Printf or a logger, then return
http.StatusInternalServerError and a constant message like "An unexpected error
occurred" without including err.Error()) so internal details aren’t leaked.
pkg/alert_rule/alert_rule.go (1)
69-71: Minor: redundant TrimSpace call.

key was already trimmed on line 61. The second TrimSpace on line 69 is unnecessary.
♻️ Suggested simplification
-		if !promLabelNameRegexp.MatchString(strings.TrimSpace(key)) {
+		if !promLabelNameRegexp.MatchString(key) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/alert_rule/alert_rule.go` around lines 69 - 71, The second
strings.TrimSpace call is redundant because key was already trimmed earlier; in
the block using promLabelNameRegexp.MatchString replace strings.TrimSpace(key)
with key to avoid unnecessary allocation while preserving behavior—look for the
check involving promLabelNameRegexp and the variable key in alert_rule.go and
remove the extra TrimSpace.
pkg/management/create_platform_alert_rule.go (1)
101-119: Consider documenting the alert name conflict check behavior.

addRuleToGroup checks for alert name conflicts (line 107) but not for spec equivalence (unlike CreateUserDefinedAlertRule which checks both). This means two platform rules with different names but identical specs could potentially be added.

The duplicate check on line 29-31 via RelabeledRules().Get() should catch exact ID matches, but the relabeled rules cache is eventually consistent. If this asymmetry is intentional, a brief comment explaining the design choice would help future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/management/create_platform_alert_rule.go` around lines 101 - 119, The
addRuleToGroup function only prevents duplicate alert names (in addRuleToGroup)
but not duplicate rule specs (unlike CreateUserDefinedAlertRule which checks
spec equivalence), creating an intentional asymmetry due to eventual consistency
of RelabeledRules().Get(); update the code by adding a concise in-code comment
inside addRuleToGroup (near the name-conflict check and before appending the new
group) that explains this design choice: that we only check Alert name here,
that spec equivalence is validated elsewhere (CreateUserDefinedAlertRule and via
RelabeledRules().Get()), and that the relabeled cache is eventually consistent
so duplicate-spec prevention is handled at a different layer rather than here;
reference addRuleToGroup and RelabeledRules().Get() in the comment so future
maintainers can locate the related logic.
pkg/management/create_user_defined_alert_rule.go (1)
97-124: Consider normalizing expression comparison for equivalence check.

rulesHaveEquivalentSpec compares expressions via a.Expr.String() != b.Expr.String(), which is sensitive to whitespace differences. However, GetAlertingRuleId normalizes expressions (collapses whitespace) before hashing.

This means two rules with equivalent normalized expressions but different whitespace would:

Have the same computed ID (duplicate detected at line 32-35)

But not be flagged as equivalent spec here (lines 100-102)

If this is intentional (exact byte-for-byte match for spec equivalence), it's fine. Otherwise, consider reusing normalizeExpr:
+import alertrule "github.com/openshift/monitoring-plugin/pkg/alert_rule"
+
 func rulesHaveEquivalentSpec(a, b monitoringv1.Rule) bool {
-	if a.Expr.String() != b.Expr.String() {
+	if alertrule.NormalizeExpr(a.Expr.String()) != alertrule.NormalizeExpr(b.Expr.String()) {
 		return false
 	}
Note: This would require exporting normalizeExpr from alertrule package.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/management/create_user_defined_alert_rule.go` around lines 97 - 124,
rulesHaveEquivalentSpec currently compares PromQL expressions with a raw string
comparison (a.Expr.String() != b.Expr.String()), which is whitespace-sensitive
and can mismatch rules that GetAlertingRuleId treats as identical; change
rulesHaveEquivalentSpec to compare normalized expressions by applying the same
normalizeExpr used by GetAlertingRuleId (or make normalizeExpr exported from the
alertrule package and call alertrule.NormalizeExpr here), then compare the
normalized strings; ensure imports are updated and keep existing checks for For
and filterBusinessLabels unchanged so equivalence now uses normalized
expressions.
pkg/k8s/alerting_rule.go (1)
100-107: Consider idempotent delete behavior for NotFound errors.

The Delete method returns an error when the resource is not found, whereas Get returns (nil, false, nil) for the same case. For idempotent delete semantics (common in reconciliation loops), you might want to suppress NotFound errors:
 func (arm *alertingRuleManager) Delete(ctx context.Context, name string) error {
 	err := arm.clientset.MonitoringV1().AlertingRules(ClusterMonitoringNamespace).Delete(ctx, name, metav1.DeleteOptions{})
 	if err != nil {
+		if errors.IsNotFound(err) {
+			return nil
+		}
 		return fmt.Errorf("failed to delete AlertingRule %s/%s: %w", ClusterMonitoringNamespace, name, err)
 	}
 
 	return nil
 }
If callers need to know whether the resource existed, this change would be inappropriate—keep the current behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/k8s/alerting_rule.go` around lines 100 - 107, The Delete method on
alertingRuleManager returns an error when the resource is NotFound; change it to
be idempotent by checking the delete error from
arm.clientset.MonitoringV1().AlertingRules(ClusterMonitoringNamespace).Delete(...)
and return nil if it is a NotFound error (use apierrors.IsNotFound). Ensure you
add the apierrors import (k8s.io/apimachinery/pkg/api/errors) and keep returning
wrapped errors for other failure cases so only NotFound is suppressed.
pkg/management/create_platform_alert_rule_test.go (1)
97-123: Consider adding test for operator-managed AlertingRule rejection.

The test suite covers GitOps-managed rejection (lines 98-123), but there's no test case for operator-managed AlertingRule rejection. The implementation in create_platform_alert_rule.go (lines 47-51) handles both cases, but only one is tested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/management/create_platform_alert_rule_test.go` around lines 97 - 123, Add
a test case parallel to the GitOps-managed one that verifies
CreatePlatformAlertRule rejects operator-managed AlertingRule: in
pkg/management/create_platform_alert_rule_test.go add an It block under the
existing Context("when target AlertingRule exists") that configures
mockK8s.AlertingRulesFunc to return an osmv1.AlertingRule whose ObjectMeta
contains the operator-management annotation/key that
create_platform_alert_rule.go checks (the same annotation/key used in the
implementation), call client.CreatePlatformAlertRule(ctx, baseRule), and assert
the call returns an error containing the same "managed by" rejection message;
mirror the structure of the existing GitOps test and use the same
MockAlertingRuleInterface GetFunc pattern so the operator-managed branch in
CreatePlatformAlertRule is exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 3: go.mod currently declares "go 1.25.0" which mismatches the project
builder images referencing Go 1.24; update either the builder images to use Go
1.25 or revert the go directive in go.mod to 1.24 so CI and dev containers use
the same toolchain. Locate the go directive in go.mod ("go 1.25.0") and change
it to "go 1.24" to match the Dockerfiles (or update the Dockerfiles to use
openshift-golang-builder with 1.25 consistently), then run go mod tidy and
verify builds locally to ensure the change resolves the mismatch.

In `@pkg/k8s/relabeled_rules.go`:
- Around line 418-427: The List method on relabeledRulesManager currently
returns shallow copies of cached monitoringv1.Rule objects (rrm.relabeledRules),
exposing shared Labels/Annotations/For maps and *relabel.Config pointers; change
it to return deep copies: while holding the lock, for each rule produce a full
deep copy (use the Rule.DeepCopy() helper if available, otherwise copy primitive
fields and make new maps for Labels/Annotations, new slices for For/Exprs, and
clone each relabel.Config into a new instance) and append the deep copy to the
result slice; apply the same deep-copy approach to the other cache-returning
method(s) referenced around lines 430-446 (e.g., any List*/Get* that returns
cached rules or relabel configs) so callers never receive shared internal
pointers or maps after the lock is released.
- Around line 142-150: After cache sync returns, ensure the manager performs an
initial successful sync before being exposed: call the existing sync(ctx) (or
trigger/wait for the first worker pass) after launching rrm.worker(ctx) and
before returning from the constructor so relabeledRules is populated;
specifically, replace or augment the current rrm.queue.Add("initial-sync")
behavior by invoking rrm.sync(ctx) (or blocking until rrm.worker's first pass
completes) and only return the manager once that initial sync completes
successfully to prevent early POST requests from missing existing-rule
conflicts.
- Around line 269-317: The loop is mutating the informer-cached PrometheusRule
because `rule` is a shallow copy and writes to `rule.Labels` modify the
original; fix by cloning the labels map before any writes: in the loop over
`group.Rules`, if `rule.Labels` is non-nil create a new map and copy all
key/value pairs into it (otherwise allocate a new map), assign that new map back
to `rule.Labels`, then proceed to add synthetic labels (e.g. setting
managementlabels.AlertNameLabel, AlertRuleLabelId,
PrometheusRuleLabelNamespace/Name and managed-by labels) and run relabel
processing (`labels.NewBuilder` / `relabel.ProcessBuilder`) against the cloned
map so you never mutate the informer-cached object.

In `@pkg/management/alert_rule_preconditions.go`:
- Around line 80-98: validatePlatformUpdatePreconditions currently ignores the
operatorManaged boolean from k8s.IsExternallyManagedObject and doesn't check
operator-managed status for arc, allowing operator-managed resources to be
updated; update the function to mirror validatePlatformDeletePreconditions by
capturing both return values from k8s.IsExternallyManagedObject(pr) and
returning the same rejection when operatorManaged is true, and add an
operator-managed check for arc (using the corresponding k8s helper that checks
annotations/labels) to return the same not-allowed error when arc is
operator-managed.

In `@pkg/management/management.go`:
- Around line 13-14: The current IsPlatformAlertRule(prId types.NamespacedName)
only checks Namespace().IsClusterMonitoringNamespace and ignores the specific
PrometheusRule identity, causing false positives; update IsPlatformAlertRule to
fetch the PrometheusRule identified by prId (e.g., via
c.k8sClient.Get/PrometheusRule client), then determine platform ownership from
the rule’s metadata (ownerReferences where Controller==true and kind indicates
platform controller, or a platform-specific annotation such as a
managed-by/owned-by key), and only return true when that explicit provenance
indicates platform ownership; if fetching the resource is impossible, fall back
to a stricter namespace policy (e.g., a whitelist of known platform namespaces)
rather than the broad cluster-monitoring label; update any callers like
CreateUserDefinedAlertRule to use the revised IsPlatformAlertRule
signature/behavior.

In `@pkg/management/testutils/k8s_client_mock.go`:
- Around line 16-23: The accessors on MockClient (PrometheusRulesFunc,
AlertRelabelConfigsFunc, AlertingRulesFunc, RelabeledRulesFunc, NamespaceFunc)
currently return a new mock on each call; change them to lazily initialize and
memoize one backing mock instance per interface on the MockClient struct (e.g.,
add fields like prometheusRules, alertRelabelConfigs, alertingRules,
relabeledRules, namespace) and have each accessor check the backing field,
create and store the mock if nil, then return the stored instance so repeated
calls (e.g., AlertingRules().Create(...) followed by AlertingRules().Get(...))
operate on the same mock store; keep TestConnectionFunc behavior unchanged.

In `@pkg/server.go`:
- Around line 153-167: The current synchronous initialization of the Kubernetes
management client (managementClient) uses k8s.NewClient and
k8sClient.TestConnection under the long-lived server context and can block
startup; wrap the creation and connection check in a bounded initialization
(e.g., create a child context with a short timeout or deadline and use that for
k8s.NewClient and k8sClient.TestConnection) and fail fast if the timeout
elapses, or alternatively refactor so NewClient returns a client that starts
informers asynchronously (no cache-sync wait) and performs cache sync in the
background while the server continues to start; update the code paths
referencing managementClient to handle startup-time timeout failures or a
not-yet-synced informer client accordingly (references: managementClient,
k8s.NewClient, k8sClient.TestConnection, pkg/k8s/client.go).

---

Nitpick comments:
In `@internal/managementrouter/create_alert_rule.go`:
- Around line 21-26: The CreateAlertRule handler decodes req.Body without any
size limit, risking memory exhaustion; wrap the request body with
http.MaxBytesReader (e.g., maxBytes constant like 1<<20 for 1MB) before decoding
in CreateAlertRule on the httpRouter, then decode into CreateAlertRuleRequest
from that wrapped reader and handle a size-exceeded error by calling
writeError(w, http.StatusRequestEntityTooLarge, "...") (or appropriate message)
so overly large payloads are rejected gracefully.

In `@internal/managementrouter/router.go`:
- Around line 62-64: The handler currently returns the full internal error
string in 500 responses; update the code that returns
http.StatusInternalServerError (the block using log.Printf("An unexpected error
occurred: %v", err) and return http.StatusInternalServerError, fmt.Sprintf("An
unexpected error occurred: %s", err.Error())) to keep detailed error logging
server-side but send a generic client message instead (e.g., log the full err
with log.Printf or a logger, then return http.StatusInternalServerError and a
constant message like "An unexpected error occurred" without including
err.Error()) so internal details aren’t leaked.

In `@pkg/alert_rule/alert_rule.go`:
- Around line 69-71: The second strings.TrimSpace call is redundant because key
was already trimmed earlier; in the block using promLabelNameRegexp.MatchString
replace strings.TrimSpace(key) with key to avoid unnecessary allocation while
preserving behavior—look for the check involving promLabelNameRegexp and the
variable key in alert_rule.go and remove the extra TrimSpace.

In `@pkg/k8s/alert_relabel_config.go`:
- Around line 95-102: The Delete method on alertRelabelConfigManager returns an
error message that only includes the resource name; update
alertRelabelConfigManager.Delete to include both namespace and name in the error
string (matching the Create/Update pattern) by formatting the error as "%s/%s"
(namespace/name) when wrapping the returned error from
arcm.clientset.MonitoringV1().AlertRelabelConfigs(namespace).Delete; ensure the
error message text mirrors the existing style used in Create/Update for
consistency.

In `@pkg/k8s/alerting_rule.go`:
- Around line 100-107: The Delete method on alertingRuleManager returns an error
when the resource is NotFound; change it to be idempotent by checking the delete
error from
arm.clientset.MonitoringV1().AlertingRules(ClusterMonitoringNamespace).Delete(...)
and return nil if it is a NotFound error (use apierrors.IsNotFound). Ensure you
add the apierrors import (k8s.io/apimachinery/pkg/api/errors) and keep returning
wrapped errors for other failure cases so only NotFound is suppressed.

In `@pkg/management/create_platform_alert_rule_test.go`:
- Around line 97-123: Add a test case parallel to the GitOps-managed one that
verifies CreatePlatformAlertRule rejects operator-managed AlertingRule: in
pkg/management/create_platform_alert_rule_test.go add an It block under the
existing Context("when target AlertingRule exists") that configures
mockK8s.AlertingRulesFunc to return an osmv1.AlertingRule whose ObjectMeta
contains the operator-management annotation/key that
create_platform_alert_rule.go checks (the same annotation/key used in the
implementation), call client.CreatePlatformAlertRule(ctx, baseRule), and assert
the call returns an error containing the same "managed by" rejection message;
mirror the structure of the existing GitOps test and use the same
MockAlertingRuleInterface GetFunc pattern so the operator-managed branch in
CreatePlatformAlertRule is exercised.

In `@pkg/management/create_platform_alert_rule.go`:
- Around line 101-119: The addRuleToGroup function only prevents duplicate alert
names (in addRuleToGroup) but not duplicate rule specs (unlike
CreateUserDefinedAlertRule which checks spec equivalence), creating an
intentional asymmetry due to eventual consistency of RelabeledRules().Get();
update the code by adding a concise in-code comment inside addRuleToGroup (near
the name-conflict check and before appending the new group) that explains this
design choice: that we only check Alert name here, that spec equivalence is
validated elsewhere (CreateUserDefinedAlertRule and via RelabeledRules().Get()),
and that the relabeled cache is eventually consistent so duplicate-spec
prevention is handled at a different layer rather than here; reference
addRuleToGroup and RelabeledRules().Get() in the comment so future maintainers
can locate the related logic.

In `@pkg/management/create_user_defined_alert_rule.go`:
- Around line 97-124: rulesHaveEquivalentSpec currently compares PromQL
expressions with a raw string comparison (a.Expr.String() != b.Expr.String()),
which is whitespace-sensitive and can mismatch rules that GetAlertingRuleId
treats as identical; change rulesHaveEquivalentSpec to compare normalized
expressions by applying the same normalizeExpr used by GetAlertingRuleId (or
make normalizeExpr exported from the alertrule package and call
alertrule.NormalizeExpr here), then compare the normalized strings; ensure
imports are updated and keep existing checks for For and filterBusinessLabels
unchanged so equivalence now uses normalized expressions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 02cd0c02-2be5-4abb-bf99-14c9b7e5a79e

📥 Commits

Reviewing files that changed from the base of the PR and between 74a097e and 80147f1.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (34)

Dockerfile
Dockerfile.dev
Dockerfile.dev-mcp
Dockerfile.devspace
Dockerfile.konflux
Dockerfile.mcp
Makefile
go.mod
internal/managementrouter/create_alert_rule.go
internal/managementrouter/create_alert_rule_test.go
internal/managementrouter/managementrouter_suite_test.go
internal/managementrouter/query_filters.go
internal/managementrouter/router.go
pkg/alert_rule/alert_rule.go
pkg/k8s/alert_relabel_config.go
pkg/k8s/alerting_rule.go
pkg/k8s/client.go
pkg/k8s/external_management.go
pkg/k8s/relabeled_rules.go
pkg/k8s/types.go
pkg/management/alert_rule_id_match.go
pkg/management/alert_rule_preconditions.go
pkg/management/client_factory.go
pkg/management/create_platform_alert_rule.go
pkg/management/create_platform_alert_rule_test.go
pkg/management/create_user_defined_alert_rule.go
pkg/management/create_user_defined_alert_rule_test.go
pkg/management/errors.go
pkg/management/label_utils.go
pkg/management/management.go
pkg/management/testutils/k8s_client_mock.go
pkg/management/types.go
pkg/managementlabels/management_labels.go
pkg/server.go

coderabbitai · 2026-03-25T15:54:22Z

go.mod

 module github.com/openshift/monitoring-plugin

-go 1.24.0
+go 1.25.0


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== go.mod directive ==" rg -n '^go\s+[0-9]+\.[0-9]+' go.mod echo echo "== Go versions referenced by Dockerfiles ==" fd '^Dockerfile(\..+)?$' -t f | while read -r f; do echo "--- ${f}" rg -n 'openshift-golang-builder|golang_[0-9]+\.[0-9]+|go[0-9]+\.[0-9]+' "$f" || true done

Repository: openshift/monitoring-plugin

Length of output: 914

Align go.mod toolchain with builder images before merge.

go.mod line 3 specifies Go 1.25.0, but Dockerfiles (Dockerfile.dev, Dockerfile.dev-mcp, Dockerfile.devspace, Dockerfile.konflux, Dockerfile.mcp) all reference openshift-golang-builder:rhel_9_golang_1.24 or rhel_9_1.24. This version mismatch will cause build failures in CI and dev containers.

Either bump all builder images to Go 1.25 or revert the go directive in go.mod to 1.24.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@go.mod` at line 3, go.mod currently declares "go 1.25.0" which mismatches the project builder images referencing Go 1.24; update either the builder images to use Go 1.25 or revert the go directive in go.mod to 1.24 so CI and dev containers use the same toolchain. Locate the go directive in go.mod ("go 1.25.0") and change it to "go 1.24" to match the Dockerfiles (or update the Dockerfiles to use openshift-golang-builder with 1.25 consistently), then run go mod tidy and verify builds locally to ensure the change resolves the mismatch.

coderabbitai · 2026-03-25T15:54:22Z

pkg/k8s/relabeled_rules.go

+	if !cache.WaitForNamedCacheSync("RelabeledRulesConfig informer", ctx.Done(),
+		rrm.prometheusRulesInformer.HasSynced,
+		rrm.secretInformer.HasSynced,
+	) {
+		return nil, fmt.Errorf("failed to sync RelabeledRulesConfig informer")
+	}
+
+	go rrm.worker(ctx)
+	rrm.queue.Add("initial-sync")


⚠️ Potential issue | 🟠 Major

Warm the cache before exposing this manager.

After WaitForNamedCacheSync, relabeledRules is still empty until the worker eventually consumes "initial-sync". Early POST /api/v1/alerting/rules requests can therefore miss existing-rule conflicts right after startup. Run one successful sync(ctx) (or wait for the first worker pass) before returning.

🔧 Example fix

- go rrm.worker(ctx) - rrm.queue.Add("initial-sync") + if err := rrm.sync(ctx); err != nil { + return nil, fmt.Errorf("initial relabeled-rules sync: %w", err) + } + + go rrm.worker(ctx)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if !cache.WaitForNamedCacheSync("RelabeledRulesConfig informer", ctx.Done(),

rrm.prometheusRulesInformer.HasSynced,

rrm.secretInformer.HasSynced,

) {

return nil, fmt.Errorf("failed to sync RelabeledRulesConfig informer")

}

go rrm.worker(ctx)

rrm.queue.Add("initial-sync")

if !cache.WaitForNamedCacheSync("RelabeledRulesConfig informer", ctx.Done(),

rrm.prometheusRulesInformer.HasSynced,

rrm.secretInformer.HasSynced,

) {

return nil, fmt.Errorf("failed to sync RelabeledRulesConfig informer")

}

if err := rrm.sync(ctx); err != nil {

return nil, fmt.Errorf("initial relabeled-rules sync: %w", err)

}

go rrm.worker(ctx)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/k8s/relabeled_rules.go` around lines 142 - 150, After cache sync returns, ensure the manager performs an initial successful sync before being exposed: call the existing sync(ctx) (or trigger/wait for the first worker pass) after launching rrm.worker(ctx) and before returning from the constructor so relabeledRules is populated; specifically, replace or augment the current rrm.queue.Add("initial-sync") behavior by invoking rrm.sync(ctx) (or blocking until rrm.worker's first pass completes) and only return the manager once that initial sync completes successfully to prevent early POST requests from missing existing-rule conflicts.

coderabbitai · 2026-03-25T15:54:22Z

pkg/k8s/relabeled_rules.go

+			for _, rule := range group.Rules {
+				// Only process alerting rules (skip recording rules)
+				if rule.Alert == "" {
+					continue
+				}
+
+				// Compute a deterministic id from the rule spec.
+				// Do not trust any user-provided value in openshift_io_alert_rule_id since
+				// PrometheusRule content (including labels) can be tampered with.
+				alertRuleId := alertrule.GetAlertingRuleId(&rule)
+				if _, exists := seenIDs[alertRuleId]; exists {
+					// A second rule that computes to the same id is ambiguous/unsupported (a "true clone").
+					// Don't silently overwrite the first rule in the cache.
+					log.Warnf("Duplicate alert rule id %q computed for %s/%s (alert=%q); skipping duplicate", alertRuleId, promRule.Namespace, promRule.Name, rule.Alert)
+					continue
+				}
+				seenIDs[alertRuleId] = struct{}{}
+
+				if rule.Labels == nil {
+					rule.Labels = make(map[string]string)
+				}
+
+				rule.Labels[managementlabels.AlertNameLabel] = rule.Alert
+
+				if rrm.namespaceManager.IsClusterMonitoringNamespace(promRule.Namespace) {
+					lb := labels.NewBuilder(labels.FromMap(rule.Labels))
+					keep := relabel.ProcessBuilder(lb, relabelConfigs...)
+					if !keep {
+						log.Infof("Skipping dropped alert %s from %s/%s", rule.Alert, promRule.Namespace, promRule.Name)
+						continue
+					}
+
+					rule.Labels = lb.Labels().Map()
+				}
+
+				rule.Labels[AlertRuleLabelId] = alertRuleId
+				rule.Labels[PrometheusRuleLabelNamespace] = promRule.Namespace
+				rule.Labels[PrometheusRuleLabelName] = promRule.Name
+
+				if arName := alertingRuleOwner(promRule); arName != "" {
+					rule.Labels[managementlabels.AlertingRuleLabelName] = arName
+				}
+
+				ruleManagedBy, relabelConfigManagedBy := rrm.determineManagedBy(ctx, promRule, alertRuleId)
+				if ruleManagedBy != "" {
+					rule.Labels[managementlabels.RuleManagedByLabel] = ruleManagedBy
+				}
+				if relabelConfigManagedBy != "" {
+					rule.Labels[managementlabels.RelabelConfigManagedByLabel] = relabelConfigManagedBy


⚠️ Potential issue | 🔴 Critical

Clone the rule labels before adding synthetic labels.

rule is only a shallow copy of the group.Rules entry. When the original rule already has a Labels map, the writes on Lines 291-317 mutate the informer-cached PrometheusRule object, which client-go treats as read-only. That can corrupt the shared cache and race with other readers.

🔧 Example fix

- for _, group := range promRule.Spec.Groups { - for _, rule := range group.Rules { + for _, group := range promRule.Spec.Groups { + for _, cachedRule := range group.Rules { + rule := cachedRule + if cachedRule.Labels != nil { + rule.Labels = make(map[string]string, len(cachedRule.Labels)) + for k, v := range cachedRule.Labels { + rule.Labels[k] = v + } + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/k8s/relabeled_rules.go` around lines 269 - 317, The loop is mutating the informer-cached PrometheusRule because `rule` is a shallow copy and writes to `rule.Labels` modify the original; fix by cloning the labels map before any writes: in the loop over `group.Rules`, if `rule.Labels` is non-nil create a new map and copy all key/value pairs into it (otherwise allocate a new map), assign that new map back to `rule.Labels`, then proceed to add synthetic labels (e.g. setting managementlabels.AlertNameLabel, AlertRuleLabelId, PrometheusRuleLabelNamespace/Name and managed-by labels) and run relabel processing (`labels.NewBuilder` / `relabel.ProcessBuilder`) against the cloned map so you never mutate the informer-cached object.

coderabbitai · 2026-03-25T15:54:22Z

pkg/k8s/relabeled_rules.go

+func (rrm *relabeledRulesManager) List(ctx context.Context) []monitoringv1.Rule {
+	rrm.mu.RLock()
+	defer rrm.mu.RUnlock()
+
+	var result []monitoringv1.Rule
+	for _, rule := range rrm.relabeledRules {
+		result = append(result, rule)
+	}
+
+	return result


⚠️ Potential issue | 🟠 Major

Do not return shallow copies of cached rules and configs.

These methods only copy the outer slice/struct. Callers still receive shared Labels/Annotations/For data and the same *relabel.Config pointers, so a caller can mutate rrm state after the lock is released. Return deep copies before exposing cached data.

Also applies to: 430-446

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/k8s/relabeled_rules.go` around lines 418 - 427, The List method on relabeledRulesManager currently returns shallow copies of cached monitoringv1.Rule objects (rrm.relabeledRules), exposing shared Labels/Annotations/For maps and *relabel.Config pointers; change it to return deep copies: while holding the lock, for each rule produce a full deep copy (use the Rule.DeepCopy() helper if available, otherwise copy primitive fields and make new maps for Labels/Annotations, new slices for For/Exprs, and clone each relabel.Config into a new instance) and append the deep copy to the result slice; apply the same deep-copy approach to the other cache-returning method(s) referenced around lines 430-446 (e.g., any List*/Get* that returns cached rules or relabel configs) so callers never receive shared internal pointers or maps after the lock is released.

coderabbitai · 2026-03-25T15:54:22Z

pkg/management/alert_rule_preconditions.go

+// validatePlatformUpdatePreconditions enforces constraints before ARC-based update.
+// pr may be nil if not fetched yet; arc may be nil if absent.
+func validatePlatformUpdatePreconditions(relabeled monitoringv1.Rule, pr *monitoringv1.PrometheusRule, arc *osmv1.AlertRelabelConfig) error {
+	// Rule-level GitOps block
+	if isRuleManagedByGitOpsLabel(relabeled) {
+		return notAllowedGitOpsEdit()
+	}
+	// PR metadata GitOps block
+	if pr != nil {
+		if gitOpsManaged, _ := k8s.IsExternallyManagedObject(pr); gitOpsManaged {
+			return notAllowedGitOpsEdit()
+		}
+	}
+	// ARC metadata GitOps block
+	if arc != nil && k8s.IsManagedByGitOps(arc.Annotations, arc.Labels) {
+		return notAllowedGitOpsEdit()
+	}
+	return nil
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify if operator management check is handled elsewhere for platform updates # or if this is an intentional omission # Check how validatePlatformUpdatePreconditions is called rg -n "validatePlatformUpdatePreconditions" --type go -C5 # Check if there's other operator management blocking for platform updates rg -n "operatorManaged.*platform" --type go -i

Repository: openshift/monitoring-plugin

Length of output: 1116

🏁 Script executed:

#!/bin/bash # Read the complete validatePlatformUpdatePreconditions function cat -n pkg/management/alert_rule_preconditions.go | sed -n '80,98p'

Repository: openshift/monitoring-plugin

Length of output: 911

🏁 Script executed:

#!/bin/bash # Find all callers of validatePlatformUpdatePreconditions rg -n "validatePlatformUpdatePreconditions" --type go -B2 -A10 | head -100

Repository: openshift/monitoring-plugin

Length of output: 1321

🏁 Script executed:

#!/bin/bash # Compare with validatePlatformDeletePreconditions - show both functions side by side echo "=== DELETE PRECONDITIONS ===" && \ cat -n pkg/management/alert_rule_preconditions.go | sed -n '68,78p' && \ echo -e "\n=== UPDATE PRECONDITIONS ===" && \ cat -n pkg/management/alert_rule_preconditions.go | sed -n '80,98p'

Repository: openshift/monitoring-plugin

Length of output: 1415

🏁 Script executed:

#!/bin/bash # Check if IsExternallyManagedObject is used consistently throughout the file rg -n "IsExternallyManagedObject" --type go -B2 -A2

Repository: openshift/monitoring-plugin

Length of output: 3379

validatePlatformUpdatePreconditions only checks GitOps management, not operator management.

Line 89 discards the operatorManaged return value from IsExternallyManagedObject(), while validatePlatformDeletePreconditions (lines 68-78) correctly checks both gitOpsManaged and operatorManaged. This asymmetry allows operator-managed PrometheusRules to be updated via ARC when they should be blocked—inconsistent with delete preconditions which properly reject operator-managed resources.

Either add operator management check on PR and ARC to match the delete logic, or add a comment explaining why operator management is intentionally excluded from platform updates.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/management/alert_rule_preconditions.go` around lines 80 - 98, validatePlatformUpdatePreconditions currently ignores the operatorManaged boolean from k8s.IsExternallyManagedObject and doesn't check operator-managed status for arc, allowing operator-managed resources to be updated; update the function to mirror validatePlatformDeletePreconditions by capturing both return values from k8s.IsExternallyManagedObject(pr) and returning the same rejection when operatorManaged is true, and add an operator-managed check for arc (using the corresponding k8s helper that checks annotations/labels) to return the same not-allowed error when arc is operator-managed.

coderabbitai · 2026-03-25T15:54:22Z

pkg/management/management.go

+func (c *client) IsPlatformAlertRule(prId types.NamespacedName) bool {
+	return c.k8sClient.Namespace().IsClusterMonitoringNamespace(prId.Namespace)


⚠️ Potential issue | 🟠 Major

Namespace label is too broad for platform-rule detection.

Line 14 ignores prId.Name and only checks whether the namespace is labeled openshift.io/cluster-monitoring=true. pkg/k8s/namespace.go shows that is just a namespace-level monitoring label, not proof that the specific PrometheusRule is platform-owned. Downstream, CreateUserDefinedAlertRule uses this to reject writes, so opted-in namespaces will be falsely treated as platform-managed.

Please base this on the actual PrometheusRule provenance/ownership, or at least a narrower platform-namespace policy, instead of the cluster-monitoring label alone.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/management/management.go` around lines 13 - 14, The current IsPlatformAlertRule(prId types.NamespacedName) only checks Namespace().IsClusterMonitoringNamespace and ignores the specific PrometheusRule identity, causing false positives; update IsPlatformAlertRule to fetch the PrometheusRule identified by prId (e.g., via c.k8sClient.Get/PrometheusRule client), then determine platform ownership from the rule’s metadata (ownerReferences where Controller==true and kind indicates platform controller, or a platform-specific annotation such as a managed-by/owned-by key), and only return true when that explicit provenance indicates platform ownership; if fetching the resource is impossible, fall back to a stricter namespace policy (e.g., a whitelist of known platform namespaces) rather than the broad cluster-monitoring label; update any callers like CreateUserDefinedAlertRule to use the revised IsPlatformAlertRule signature/behavior.

coderabbitai · 2026-03-25T15:54:23Z

pkg/management/testutils/k8s_client_mock.go

+type MockClient struct {
+	TestConnectionFunc      func(ctx context.Context) error
+	PrometheusRulesFunc     func() k8s.PrometheusRuleInterface
+	AlertRelabelConfigsFunc func() k8s.AlertRelabelConfigInterface
+	AlertingRulesFunc       func() k8s.AlertingRuleInterface
+	RelabeledRulesFunc      func() k8s.RelabeledRulesInterface
+	NamespaceFunc           func() k8s.NamespaceInterface
+}


⚠️ Potential issue | 🟠 Major

Return stable mock interfaces from MockClient.

With the current defaults, each accessor allocates a new empty mock. A test that writes through client.AlertingRules().Create(...) and then reads through client.AlertingRules().Get(...) will hit two different stores, unlike the real client. Keep one lazily initialized mock instance per interface and return the same object every time.

🔧 Example fix

type MockClient struct { TestConnectionFunc func(ctx context.Context) error PrometheusRulesFunc func() k8s.PrometheusRuleInterface AlertRelabelConfigsFunc func() k8s.AlertRelabelConfigInterface AlertingRulesFunc func() k8s.AlertingRuleInterface RelabeledRulesFunc func() k8s.RelabeledRulesInterface NamespaceFunc func() k8s.NamespaceInterface + + prometheusRules k8s.PrometheusRuleInterface + alertRelabelConfigs k8s.AlertRelabelConfigInterface + alertingRules k8s.AlertingRuleInterface + relabeledRules k8s.RelabeledRulesInterface + namespace k8s.NamespaceInterface } func (m *MockClient) PrometheusRules() k8s.PrometheusRuleInterface { if m.PrometheusRulesFunc != nil { return m.PrometheusRulesFunc() } - return &MockPrometheusRuleInterface{} + if m.prometheusRules == nil { + m.prometheusRules = &MockPrometheusRuleInterface{} + } + return m.prometheusRules }

Apply the same lazy-init pattern to the other accessors.

Also applies to: 34-70

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/management/testutils/k8s_client_mock.go` around lines 16 - 23, The accessors on MockClient (PrometheusRulesFunc, AlertRelabelConfigsFunc, AlertingRulesFunc, RelabeledRulesFunc, NamespaceFunc) currently return a new mock on each call; change them to lazily initialize and memoize one backing mock instance per interface on the MockClient struct (e.g., add fields like prometheusRules, alertRelabelConfigs, alertingRules, relabeledRules, namespace) and have each accessor check the backing field, create and store the mock if nil, then return the stored instance so repeated calls (e.g., AlertingRules().Create(...) followed by AlertingRules().Get(...)) operate on the same mock store; keep TestConnectionFunc behavior unchanged.

coderabbitai · 2026-03-25T15:54:23Z

pkg/server.go

+	// Initialize management client if management API feature is enabled
+	var managementClient management.Client
+	if alertManagementAPIMode {
+		k8sClient, err := k8s.NewClient(ctx, k8sconfig)
+		if err != nil {
+			return nil, fmt.Errorf("failed to create k8s client for alert management API: %w", err)
+		}
+
+		if err := k8sClient.TestConnection(ctx); err != nil {
+			return nil, fmt.Errorf("failed to connect to kubernetes cluster for alert management API: %w", err)
+		}
+
+		managementClient = management.New(ctx, k8sClient)
+		log.Info("alert management API enabled")
+	}


⚠️ Potential issue | 🟠 Major

Management client init can hang the whole server startup.

On Lines 156-165, k8s.NewClient runs during server construction under the server-lifetime context. pkg/k8s/client.go initializes several informer-backed managers, and those managers wait for cache sync before returning. A slow/unreachable API server or missing CRD can therefore wedge plugin startup indefinitely once this feature flag is enabled.

Please add a bounded initialization path here, or split informer lifetime from init-time sync, so startup fails fast instead of hanging the rollout.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/server.go` around lines 153 - 167, The current synchronous initialization of the Kubernetes management client (managementClient) uses k8s.NewClient and k8sClient.TestConnection under the long-lived server context and can block startup; wrap the creation and connection check in a bounded initialization (e.g., create a child context with a short timeout or deadline and use that for k8s.NewClient and k8sClient.TestConnection) and fail fast if the timeout elapses, or alternatively refactor so NewClient returns a client that starts informers asynchronously (no cache-sync wait) and performs cache sync in the background while the server continues to start; update the code paths referencing managementClient to handle startup-time timeout failures or a not-yet-synced informer client accordingly (references: managementClient, k8s.NewClient, k8sClient.TestConnection, pkg/k8s/client.go).

sradco changed the title ~~management: add CRD support and create alert rule API~~ CNV-80608: add CRD support and create alert rule API Mar 12, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 12, 2026

openshift-ci bot requested review from jgbernalp and zhuje March 12, 2026 16:33

sradco mentioned this pull request Mar 12, 2026

CNV-80608: k8s: add alert listing/query and filter primitives #832

Closed

sradco force-pushed the alert-mgmt-restructured-02-create-rule branch from 9688dd6 to 2767365 Compare March 12, 2026 17:58

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2026

sradco force-pushed the alert-mgmt-restructured-02-create-rule branch from 2767365 to a717988 Compare March 25, 2026 14:47

sradco force-pushed the alert-mgmt-restructured-02-create-rule branch from a717988 to 80147f1 Compare March 25, 2026 14:55

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 25, 2026

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

		func (c *client) IsPlatformAlertRule(prId types.NamespacedName) bool {
		return c.k8sClient.Namespace().IsClusterMonitoringNamespace(prId.Namespace)

Conversation

sradco commented Mar 12, 2026

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci-robot commented Mar 12, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 12, 2026

Uh oh!

sradco commented Mar 12, 2026

Uh oh!

openshift-merge-robot commented Mar 18, 2026

Uh oh!

openshift-ci bot commented Mar 25, 2026

Uh oh!

simonpasquier commented Mar 25, 2026

Uh oh!

coderabbitai bot commented Mar 25, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

openshift-ci-robot commented Mar 12, 2026 •

edited by openshift-ci bot

Loading